Cluster analysis example

library(dplyr)
library(TeachingDemos)
library(KernSmooth)
library(MASS)
library(lattice)
library(ggplot2)
library(RColorBrewer)
library(viridis)
library(factoextra)
library(NbClust)

Analysis on demographic data in EU

We continue to analyze the dataset in EUdemo.txt (see the previous example of PCA). In this study we aim at exploring whether countries can be grouped into clusters portraying different demographic situations.

Remember that one row corresponds to the centroid of the first 15 EU countries.

  1. Eliminate it in order to make the analysis just on individual countries.
x <- read.table("../../EUdemo.txt", ..., ...)
n <- nrow(x)
p <- ncol(x)
ind_EU <- ...
x <- ...
# x <- x %>% slice(-ind_EU) # rownames are lost!

Data representation (What features should be used?)

When our information consists of a profile matrix, an important step in cluster analysis is the choice of variables. E.g., We can use the original variables or a transformation of them, from a simple standardization to a PC scoring or to other possibilities.

Let’s decide whether to use raw or transformed data for clustering. To this aim, look at the distribution of the raw variables.

x %>% summary
      NAT               MOR              CR.N              MIG          
 Min.   : 0.2254   Min.   : 5.127   Min.   :-6.3696   Min.   :-10.9020  
 1st Qu.: 9.4017   1st Qu.: 8.778   1st Qu.:-1.1754   1st Qu.: -0.1701  
 Median :10.4273   Median : 9.889   Median : 0.4460   Median :  0.9284  
 Mean   :10.8896   Mean   :10.063   Mean   : 0.8267   Mean   :  1.1903  
 3rd Qu.:12.4762   3rd Qu.:10.890   3rd Qu.: 3.5132   3rd Qu.:  2.4504  
      CR.T              M.IN             VEC              N.FI      
 Min.   :-15.754   Min.   : 2.400   Min.   : 17.00   Min.   :1.100  
 1st Qu.: -1.207   1st Qu.: 4.600   1st Qu.: 60.65   1st Qu.:1.290  
 Median :  1.826   Median : 5.850   Median : 78.45   Median :1.430  
 Mean   :  2.017   Mean   : 8.955   Mean   : 74.00   Mean   :1.509  
 3rd Qu.:  4.838   3rd Qu.:11.125   3rd Qu.: 88.60   3rd Qu.:1.732  
      E.P              NU              E.M             V.M       
 Min.   :24.70   Min.   : 3.900   Min.   :21.70   Min.   :59.90  
 1st Qu.:26.60   1st Qu.: 4.800   1st Qu.:23.30   1st Qu.:68.28  
 Median :28.35   Median : 5.250   Median :25.95   Median :72.90  
 Mean   :28.02   Mean   : 5.673   Mean   :25.65   Mean   :71.56  
 3rd Qu.:29.32   3rd Qu.: 6.225   3rd Qu.:27.60   3rd Qu.:75.03  
      V.F       
 Min.   :71.20  
 1st Qu.:75.35  
 Median :79.05  
 Mean   :78.26  
 3rd Qu.:81.03  
 [ reached getOption("max.print") -- omitted 1 row ]
x %>% var %>% diag %>% sqrt
       NAT        MOR       CR.N        MIG       CR.T       M.IN 
 2.9578658  2.1624788  4.1705540  3.6036389  6.3627741  7.3494793 
       VEC       N.FI        E.P         NU        E.M        V.M 
24.7585519  0.3060433  1.7561905  1.5664737  2.3358947  4.6820333 
       V.F 
 3.2786714 
  1. Finally, we take the decision of having the variables standardized.
x <- ...

Here is the x dataset after processing (in html, scroll it, both horizontally and vertically).

NAT MOR CR.N MIG CR.T M.IN VEC N.FI E.P NU E.M V.M V.F
Austria -0.4188946 -0.1849450 -0.2011949 0.3486411 0.0655819 -0.6333782 0.7271225 -0.6167428 0.0455531 -0.4931459 0.5790073 0.7560818 0.8364669
Belgio 0.0938089 0.0946059 0.0174775 0.1065843 0.0718212 -0.4973141 0.8482927 0.3308355 0.2733189 -0.8761718 0.1937159 0.6920070 0.8669670
Danimarca 0.5234690 0.4901284 0.1171209 0.1598004 0.1672732 -0.6469846 0.2666755 0.7229369 0.8996746 0.6559319 1.7348813 0.5638576 0.2264637
Finlandia 0.0883630 -0.2398760 0.1870477 -0.1476391 0.0389853 -0.7286230 0.2989876 0.7556120 0.8996746 -0.6208212 0.8786783 0.4784246 0.8364669
Francia 0.5750751 -0.4155859 0.6233442 -0.0955218 0.3544782 -0.5653462 0.4161188 0.8536374 0.7288503 -0.5569835 0.8358682 0.7133653 1.2634691
Germania -0.5079145 0.1133973 -0.4190241 0.3528009 -0.0748409 -0.5789526 1.0785162 -0.4860423 0.3302603 -0.3016329 0.5361971 0.6279323 0.6839661
Grecia -0.4078752 -0.1316862 -0.2209948 0.3283008 0.0410838 -0.4156757 1.5995483 -0.6820930 0.3872017 0.1452307 0.3649565 0.8201565 0.6534659
Irlanda 1.1162565 -0.7503568 1.1807467 1.1474068 1.4237826 -0.4701013 -0.9167539 1.2130637 1.4121475 -0.4931459 1.0927291 0.4997829 0.2264637
Italia -0.5310467 -0.0697143 -0.3404846 0.1578244 -0.1337887 -0.5109205 2.0398002 -0.9434940 1.1274403 -0.5569835 0.8358682 0.9483059 1.1719686
Lussemburgo 0.6963896 -0.5899557 0.7997963 2.6856328 2.0452784 -0.8238679 0.0566471 0.7229369 0.7857917 -0.5569835 0.7502479 0.6706488 0.8974672
Paesi Bassi 0.6032157 -0.5444131 0.7101006 0.4383851 0.7137287 -0.5109205 -0.0362501 0.4615360 1.2982646 0.0175554 0.8786783 0.7987982 0.6839661
Portogallo 0.2446528 0.3419498 -0.0037906 -0.0524779 -0.0322061 -0.4564949 0.7109665 -0.0612658 0.3302603 0.7836072 0.0652855 0.0939763 0.2569638
Regno Unito 0.2964898 0.2586459 0.0761674 0.4349457 0.2962619 -0.4292821 0.3191826 0.5595613 0.2163774 -0.3654706 0.5790073 0.7347235 0.4704650
Spagna -0.4412617 -0.3080667 -0.1532184 -0.0486988 -0.1280100 -0.5517398 1.4783781 -1.0088442 1.4690889 -0.3016329 0.7930580 0.7347235 1.1414685
Svezia -0.3152388 0.2904594 -0.3741818 0.0988957 -0.1892510 -0.7558359 0.7877076 -0.0285907 1.0135575 -1.0676847 1.7776915 1.1832466 1.1109683
Albania -3.6053965 -2.2824675 -1.3735563 -1.3856389 -1.6850878 2.9587130 -2.2294317 3.5656719 0.1594360 0.9751201 -1.1762089 -0.6535622 -0.8715421
Bielorussia -0.5682678 1.8487202 -1.3616126 -3.3555729 -2.7929566 0.3462830 -0.1533813 -0.7147681 -1.7765726 0.9751201 -1.4758799 -1.9991314 -1.2985443
Bosnia-Erzegovina 0.7486665 -1.1602882 1.1325962 -1.5699589 -0.1467931 0.2782510 -1.8376479 0.1674600 -1.1502169 0.4005813 -1.0049683 -0.3972633 -0.9325424
Bulgaria -0.7042754 1.6438904 -1.3518662 -0.3302932 -1.0731622 0.8089008 1.1229453 -0.9108188 -1.8904555 -0.8761718 -0.9193480 -0.6962787 -0.9630425
Cipro 0.6277560 -1.1536278 1.0433899 -0.5513626 0.3716307 -0.4020693 -1.0258072 1.0823632 0.3302603 4.5500286 -0.0203348 0.7987982 0.6534659
Croazia -0.1682199 0.6827745 -0.4733322 -0.5996902 -0.6498934 -0.1707604 -0.4643850 -0.4206921 -0.1252711 -0.3016329 -0.2343856 -0.3545468 -0.4445398
Estonia -0.7518377 1.2774651 -1.1956030 -0.4264828 -1.0252159 0.0877613 0.2666755 -0.8781437 -0.8085683 -1.1315224 -0.4912465 -1.2943095 -0.5970406
Islanda 1.3161916 -1.4854988 1.7037253 0.7702909 1.5529906 -0.8918999 -0.9854171 1.5724899 0.3872017 -0.0462823 1.7776915 1.3327543 0.9889677
Iugoslavia 0.5205388 0.2221207 0.2540076 -0.3433779 -0.0279842 0.6592304 -0.4603460 0.8536374 -0.6946855 -0.2377953 -0.6196769 -0.3331886 -1.0240429
Lettonia -0.9843885 1.6031229 -1.5293913 -0.5471147 -1.3123227 0.3598894 0.3312997 -1.1395446 -0.6946855 -1.1315224 -0.6196769 -1.4438172 -0.8715421
Lituania -0.3552825 0.3463268 -0.4315499 -0.2327850 -0.4147052 -0.0483027 -0.2624346 -0.5187175 -0.8655097 -0.5569835 -1.0905886 -0.9739358 -0.3225392
Macedonia 1.2358387 -0.7605571 1.2708465 -0.6067469 0.4893521 0.7816880 -1.3166158 0.8209623 -1.0363340 0.8474448 -1.0049683 -0.2477556 -1.1460435
Malta 0.3490564 -0.9779687 0.7546477 0.1108205 0.5574072 -0.2387924 -0.6663354 0.9843378 0.5010846 0.4005813 0.4505768 0.5211411 0.5619654
Moldova -0.2512903 0.6210481 -0.5002419 -0.7251229 -0.7385720 1.2170930 -1.4539421 0.5268862 -1.5488069 0.5282566 -1.6899307 -1.5719666 -2.0610483
Norvegia 0.8115669 0.0310710 0.5594739 0.8513534 0.8488886 -0.6878038 0.0929982 1.0823632 0.7288503 -0.2377953 1.0927291 0.8628730 0.8669670
Polonia -0.3410263 -0.0913270 -0.1945107 -0.4307829 -0.3714738 -0.0210899 -0.4966971 -0.4533672 -0.4669197 0.0175554 -1.0049683 -0.7176369 -0.3225392
Repubblica ceca -0.7392766 0.2839451 -0.6715430 -0.0928359 -0.4927494 -0.5925590 0.3676507 -1.2375700 -0.6377440 -0.3016329 -0.6624871 -0.0341732 -0.0175376
Romania -0.1521408 0.8039003 -0.5247334 -0.3611646 -0.5484928 1.2987315 -0.1129913 -0.6820930 -1.3779826 0.3367436 -1.0477784 -0.9525776 -1.2680441
Russia -0.8677057 2.1409561 -1.7255093 -0.0237870 -1.1444772 1.0810290 -0.2260835 -1.1068695 -1.3210412 0.3367436 -1.5186901 -2.4903710 -1.7865469
San Marino 0.1457497 -1.1633592 0.7065842 2.8111854 2.0552898 -0.7694423 1.3047007 -0.6820930 2.3801518 1.9326849 1.1783494 1.8880686 1.3244694
Slovacchia -0.1604934 -0.1628653 -0.0293786 -0.2582975 -0.1655466 -0.1027284 -0.6663354 -0.5840677 -0.9224512 -0.3654706 -1.0477784 -0.5467710 -0.3225392
Slovenia -0.6980148 -0.1760062 -0.4037891 1.2230198 0.4280048 -0.6469846 0.4847820 -0.9761691 -0.0113883 -1.1315224 0.2793362 0.0512598 0.3179642
Svizzera 0.0286267 -0.6077510 0.3354284 0.6407987 0.5827850 -0.5925590 0.5494061 -0.0939410 0.9566160 0.0175554 0.8786783 1.1191719 1.2939693
Turchia 3.4495426 -1.7624610 3.3603616 -0.1510413 2.1170437 3.9383742 -2.3021338 -1.3355953 -0.8085683 0.3367436 -1.3046393 -1.0807270 -2.1525488
Ungheria -0.5014068 1.9242497 -1.3533557 -0.3302932 -1.0741385 -0.0755156 0.4645869 -0.7147681 -0.5238612 -0.7484965 -0.6196769 -1.1020853 -0.9325424

From a preliminary to a more focused exploration for assessing clustering tendency

Icons

A preliminary exploration of the similarity between countries can be done by icon graphics.

  1. Let’s see the stars graph, both versions: “star” and “colorful wheel”.
stars(x, scale = TRUE, key.loc = NULL, draw.segments = FALSE, ...)

See ?stars

Another icon graph is given by the Chernoff faces.

  1. Let’s see the faces routine of TeachingDemos pkg.
faces(xy, scale = TRUE, ...)

See ?faces

Parallel coordinates

A graph that gives a dual view of data matrix consists of the parallel coordinates.

Let’s see at various routines doing parallel plots in either MASS or lattice pkgs, and, moreover, in iplots pkg for interactive graphics.

  1. See -?MASS::parcoord, ?lattice::parallelplot, ?iplots::ipcp.
parcoord(x)
parcoord(x, col = 1:n)

parallelplot(x)

# Loading iplots is unnecessary if you load JGR, that is recommended to use
# when working with iplots
library(iplots)
ipcp(x)
# Loading iplots is unnecessary if you load JGR, that is recommended to use
# when working with iplots
library(JGR)
JGR()
# type commands in the editor window of JGR and press the 'Enter' key e.g.,
# an interactive parallel coordinates plot
ipcp(x)

Plots on reduced data

The graph of scores of first principal components can give a view of the similarity between countries on a reduced space.

  1. Let’s obtain the first two PCs by using prcomp or princomp and plot the cloud of scores. See ?prcomp, ?princomp.
pcdemo <- prcomp(x)
pc <- pcdemo$x[, 1:2]
plot(pc)
...
...

We can have a clearer vision of any clustering structure by estimating the density of the bivariate distribution of the first 2 PCs.

Let’s use the bkde2D routine in KernSmooth pkg and then use contour or persp« for plotting the density.

est <- bkde2D(pc, bandwidth = c(0.5, 0.5), gridsize = c(20, 20))
contour(est$x1, est$x2, est$fhat, main = "Empirical 2D-density ...", col = 4)
persp(est$fhat, main = "... of the first 2 PC", col = "green3")

How many modes do you see?

We can also use the compound code offered by ggplot2 library.

pc <- pc %>% as.data.frame
p1 <- ggplot(pc, aes(x = PC1, y = PC2)) + stat_density2d(aes(fill = ..density..), 
    geom = "tile", contour = F)
p2 <- p1 + scale_fill_distiller(palette = "RdGy")
p3 <- p1 + scale_fill_viridis(option = "inferno")
ggpubr::ggarrange(p1, p2, p3, ncol = 3)

Let turn back to the parallel coordinates:

  1. detect the possible anomalous profile of Turchia; (b) detect how many components are useful to discriminate between units (and indirectly, choose the number of components to keep).

Ordered Dissimilarity Matrix

  1. Compute the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure.
  2. Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)

See ?fviz_dist in factoextra library.

fviz_dist(dist(x), ...)

  1. Can you detect a clustering tendency from the ODM graph?

Hierarchical methods

Hierarchical methods need a distance matrix as input. Then, first decision is on the type of pair-wise distance. We consider the most popular one: the Euclidean distance.

Let’s see the dist routine and choose the Euclidean distance for computing the distance matrix.

d <- dist(x)
d
d %>% as.matrix
x[1:2, ]
                NAT         MOR        CR.N       MIG       CR.T
Austria -0.41889459 -0.18494502 -0.20119485 0.3486411 0.06558189
Belgio   0.09380886  0.09460592  0.01747745 0.1065843 0.07182119
              M.IN       VEC       N.FI        E.P         NU       E.M
Austria -0.6333782 0.7271225 -0.6167428 0.04555314 -0.4931459 0.5790073
Belgio  -0.4973141 0.8482927  0.3308355 0.27331887 -0.8761718 0.1937159
              V.M       V.F
Austria 0.7560818 0.8364669
Belgio  0.6920070 0.8669670
x[1:2, ] %>% dist
        Austria
Belgio 1.315535
d[1]
[1] 1.315535
as.matrix(d)[1:7, 1:7] %>% round(2)
          Austria Belgio Danimarca Finlandia Francia Germania Grecia
Austria      0.00   1.32      2.70      1.91    2.23     0.68   1.19
Belgio       1.32   0.00      2.53      1.31    1.52     1.36   1.78
Danimarca    2.70   2.53      0.00      1.89    2.15     2.57   2.80
Finlandia    1.91   1.31      1.89      0.00    0.94     1.97   2.40
Francia      2.23   1.52      2.15      0.94    0.00     2.41   2.65
Germania     0.68   1.36      2.57      1.97    2.41     0.00   0.85
Grecia       1.19   1.78      2.80      2.40    2.65     0.85   0.00

For what concerns the type of algorithm, we consider a bottom-up hierarchical algorithm. Let’s see the best-known distance methods between groups.

Nearest-neighbour or single linkage method

ls <- hclust(d, method = "single")
ls %>% names
[1] "merge"       "height"      "order"       "labels"      "method"     
[6] "call"        "dist.method"
ls %>% plot

  1. Comment on the form of the bunches and on the information generally provided by the nearest-neighbour linkage.

Let’s try to identify what is the best level to cut the dendrogram by looking for the highest separation between clusters. Use rect.hclust to highlight the best partition.

m <- length(ls$height)
m
[1] 39

s <- c()
for (j in m:2) s <- c(s, ls$height[j]/ls$height[j - 1])
length(s)
[1] 38


max(s)
[1] 1.472332

b <- (m:2)[s == max(s)]
b
[1] 35

ls$height[b]/ls$height[b - 1]
[1] 1.472332

k <- (2:m)[s == max(s)]
k
[1] 6

rect.hclust(ls, k = k, which = k, border = 4)

It is wise to detect what is the 2nd-best, and so on (stop at your choice).

The second-best is

k2 <- (2:m)[s == max(s[-(k - 1)])]
k2
[1] 3

The 3-partition, the most relevant after the one shown above, separates Turchia and then Albania from the rest of Europe. Again, the single-linkage make anomalous units appear.

Furthest-neighbour or complete linkage method

lc <- hclust(d)
plot(lc)
  1. Comment on the form of the bunches and on the information generally provided by the furthest-neighbour linkage.
s <- c()
for (j in m:2) s <- c(s, lc$height[j]/lc$height[j - 1])

k = (2:m)[s == max(s)]
k
[1] 30
k2 <- (2:m)[s == max(s[-(k - 1)])]
k3 <- (2:m)[s == max(s[-c(k - 1, k2 - 1)])]
k4 <- (2:m)[s == max(s[-c(k - 1, k2 - 1, k3 - 1)])]
k5 <- (2:m)[s == max(s[-c(k - 1, k2 - 1, k3 - 1, k4 - 1)])]
k2, k3, k4, k5:  3, 12, 4, 6

The 3-partition appears to be relevant: on one side Albania, Moldova, etc.; on the other Cipro, … UE …, Norvegia, San Marino, Islanda …; and, distant from all countries, Turchia.

Continue to analyse the hierarchical clustering with Average linkage, Centroid method, and Ward method.

Average Linkage method

lm <- hclust(d, method = "ave")

k, k2, k3:  6, 3, 9

  1. Compare the results of the centroid method to the previous ones.

Centroid method

lcen <- hclust(d, method = "cen")

k, k2, k3:  6, 3, 30

  1. How is the distance at which fusions occur? Do you note any similarity with previous results?

Ward method

k, k2, k3:  2, 3, 12

Let’s conclude by summarising what each method tell us and what is/are the most promising cluster numbers.

    k1 k2 k3 k4 k5
sl   6  3 31  2  7
cl  30  3 12  4  6
al   6  3  9  8 30
cen  6  3 30  8  9
w    2  3 12 30  8

Partitional methods

The kmeans routine

We consider the most popular partitional method: K-means. We need to fix the cluster number at priori. Let’s start with k=6.

We can take a 6 cluster partition previously obtained by hierarchical clustering, or we can let R choose randomly the initial set.

See ?cutree.

kmeans(x, centers, iter.max = 10, nstart = 1, ...)

See ?kmeans.

  1. Plot the first two PC scores and colour points accordingly to the initial clusters of membership.
lm_cl6 <- cutree(lm, k = 6)

plot(pc, ...)
...
...

Then, we run kmeans by using the initial set above derived.

init6 <- aggregate(x, list(lm_cl6), mean)
init6
  Group.1         NAT        MOR        CR.N         MIG         CR.T
1       1 -0.00141097  0.1289484 -0.06786191  0.07521229 -0.001883425
2       2 -3.60539653 -2.2824675 -1.37355634 -1.38563891 -1.685087825
3       3 -0.56826784  1.8487202 -1.36161257 -3.35557293 -2.792956587
4       4  0.62775602 -1.1536278  1.04338993 -0.55136257  0.371630744
5       5  0.14574969 -1.1633592  0.70658422  2.81118544  2.055289802
        M.IN        VEC        N.FI         E.P         NU         E.M
1 -0.1734817  0.1258872 -0.05473082 -0.00813449 -0.2505628  0.07996324
2  2.9587130 -2.2294317  3.56567194  0.15943601  0.9751201 -1.17620886
3  0.3462830 -0.1533813 -0.71476814 -1.77657263  0.9751201 -1.47587991
4 -0.4020693 -1.0258072  1.08236319  0.33026030  4.5500286 -0.02033482
5 -0.7694423  1.3047007 -0.68209303  2.38015179  1.9326849  1.17834937
          V.M         V.F
1  0.02990154  0.06699142
2 -0.65356220 -0.87154206
3 -1.99913144 -1.29854430
4  0.79879825  0.65346592
5  1.88806858  1.32446943
 [ reached getOption("max.print") -- omitted 1 row ]

km6 <- kmeans(x, init6[, -1], 100)
names(km6)
[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"      

Play with the components of the output to discover their content.

km6$iter
km6$ifault
km6$size
km6$centers
km6$tot.w

fitted(km6)
fitted(km6, method = "classes")
  1. Now, plot the first two PC scores and colour points accordingly to the obtained partition.
plot(pc, ...)
...
...

The 6-partition obtained by kmeans is significatively different from that obtained by the hierarchical method.

Remember that outliers are likely present in our data. Try larger cluster numbers to enucleate them.

In case of k=6 cluster sizes and outliers are respectively:

km6 ...
... ...
[1] 14  1 14  3  7  1
[1] "Albania" "Turchia"

The kmeans function has an nstart option that attempts multiple initial configurations. This approach is often recommended.

km6best <- kmeans(x, 6, 100, 25)

Is the final clustering (right panel) the same as the previous grouping (left panel)?

Sizes and outliers are

[1]  8  1 14  6  1 10
[1] "Albania" "Turchia"
km6$tot.w
[1] 144.0562
km6best$tot.w
[1] 142.8716

In any case, try different cluster numbers and evaluate whether the results vary by changing initializations.

E.g,, tet try a larger cluster number k=9.

lm_cl9 <- cutree(lm, k = 9)
init9 <- aggregate(x, list(lm_cl9), mean)
km9 <- kmeans(x, init9[, -1], 100)

Has the increase of cluster number obtained the expected result?

Cluster sizes and outliers are respectively:

[1] 14  6  1  3  3 10  1  1  1
[1] "Albania"    "Cipro"      "San Marino" "Turchia"   

The comparison between the 9-partition obtained from the tree-derived intialization and the best 9-partition selected among multiple random initializations is below: Is the final clustering (right panel) the same as the previous grouping (left panel)?

Sizes and outliers are

[1]  1  5  1  1  1 11  1  9 10
[1] "Albania"     "Bielorussia" "Cipro"       "San Marino"  "Turchia"    
km9$tot.w
[1] 88.79624
km9best$tot.w
[1] 89.9779

However, classification appears rather unstable.

Let try k=3

lm_cl3 <- cutree(lm, k = 3)
init3 <- aggregate(x, list(lm_cl3), mean)
km3 <- kmeans(x, init3[, -1], 100)

A greater stability is obtained by choosing k=3.

km3best <- kmeans(x, 3, 100, 25)

14) Can you notice any coincidence with partitions obtained by hieachical clusterings?

As regards the issue of possible influence of outliers, try different distance methods (e.g. Manhattan distance). Remember that kmeans utilizes only the Euclidean distance. (Refer to cluster library for more algorithms and methods.)

More on cluster number selection

In order to select the best k(s) let’s consider one out of the several indexes measuring the quality of clustering.

E.g., we consider the ratio between deviance/total deviance, which is a measure of the separation between clusters.

A possible procedure is

kmax <- 30

for (g in 2:kmax) assign(paste0("km", g), kmeans(x, g, 100, 10000))
betweenss <- vector(l = kmax - 1)
for (g in 2:kmax) betweenss[g - 1] <- get(paste0("km", g))[["betweenss"]]

totss <- km2$totss

plot(2:kmax, betweenss/totss, type = "b", pch = 20, xlab = "cluster #", cex = 0.65)

Can you see a “knee” in the plot?

The package NbClust provides 30 indexes for determining the optimal number of clusters in the data.

NbClust(data = NULL, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = NULL, index = "all", ...)
  1. Run NbClust separately for each cluster and plot the results (the overall run cannot be carried out because of indefinite results for some indexes).

Krzanowski and Lai (1988)

nc <- NbClust(x, method = "kmeans", index = "kl")
nc
$All.index
      2       3       4       5       6       7       8       9      10 
 2.5776  2.4954  0.3920  2.6342  0.8426  1.7706  0.4228 38.5523  0.0425 
     11      12      13      14      15 
 1.1385  7.1692  0.1656  4.9314  1.7001 

$Best.nc
Number_clusters     Value_Index 
         9.0000         38.5523 

$Best.partition
          Austria            Belgio         Danimarca         Finlandia 
                2                 2                 2                 2 
          Francia          Germania            Grecia           Irlanda 
                2                 2                 2                 5 
           Italia       Lussemburgo       Paesi Bassi        Portogallo 
                2                 5                 5                 2 
      Regno Unito            Spagna            Svezia           Albania 
                2                 2                 2                 4 
      Bielorussia Bosnia-Erzegovina          Bulgaria             Cipro 
                3                 9                 3                 1 
          Croazia           Estonia           Islanda       Iugoslavia  
                6                 3                 5                 9 
         Lettonia          Lituania         Macedonia             Malta 
                3                 6                 9                 5 
          Moldova          Norvegia           Polonia   Repubblica ceca 
                9                 5                 6                 6 
          Romania            Russia        San Marino        Slovacchia 
                6                 3                 8                 6 
         Slovenia          Svizzera           Turchia          Ungheria 
                2                 2                 7                 3 

Calinski and Harabasz (1974)

Hartigan (1975)

Sarle (1975), Scott and Simons (1971) Marriot (1971) Milligan and Cooper (7 et 8) (1975) Friedman and Rubin (9 and 10) (1967) NO

Hubert and Levin (1976)

Davies and Bouldin (1979)

Rousseeuw (1987)

Duda and Hart (1973)

Duda and Hart (1973)

Beale (1969)

Warning in pf(beale, pp, df2): Si è prodotto un NaN

Ratkowsky and Lance (1978)

Ball and Hall (1965)

Milligan (1981)

Tibshirani et al. (2001)

Frey and Van Groenewoud (1972)

McClain and Rao (1975)

Baker and Hubert (1975)

Rohlf (1974) Milligan (1981)

Dunn (1974)

Hubert and Arabie (1975) NO

Halkidi et al. (2000)

Lebart et al. (2000) NO

Halkidi and Vazirgiannis (2001)

Numbers of clusters chosen by 20 criteria:

Results

Let’s show the results for the selected clustering partitions.

From the above internal validation, 2/3/9-group partitions result interesting. We show the results for the 3-group clustering.

Cluster size and units of the best 2-group partition are respectively:

km2$size
[1] 18 22
for (h in 1:2) cat("cluster ", h, "\n", row.names(x)[km2$cluster == h], "\n")
cluster  1 
 Albania Bielorussia Bosnia-Erzegovina Bulgaria Croazia Estonia Iugoslavia  Lettonia Lituania Macedonia Moldova Polonia Repubblica ceca Romania Russia Slovacchia Turchia Ungheria 
cluster  2 
 Austria Belgio Danimarca Finlandia Francia Germania Grecia Irlanda Italia Lussemburgo Paesi Bassi Portogallo Regno Unito Spagna Svezia Cipro Islanda Malta Norvegia San Marino Slovenia Svizzera 

Cluster size, units, centroids of the best 3-group partition are respectively:

km3$size
[1] 21 14  5
for (h in 1:3) cat("cluster ", h, "\n", row.names(x)[km3$cluster == h], "\n")
cluster  1 
 Austria Belgio Danimarca Finlandia Francia Germania Grecia Irlanda Italia Lussemburgo Paesi Bassi Portogallo Regno Unito Spagna Svezia Islanda Malta Norvegia San Marino Slovenia Svizzera 
cluster  2 
 Bielorussia Bulgaria Croazia Estonia Iugoslavia  Lettonia Lituania Moldova Polonia Repubblica ceca Romania Russia Slovacchia Ungheria 
cluster  3 
 Albania Bosnia-Erzegovina Cipro Macedonia Turchia 
         NAT        MOR       CR.N        MIG       CR.T       M.IN
1  0.1699365 -0.2869012  0.2692849  0.5867785  0.5088355 -0.5906152
2 -0.4303623  0.9388805 -0.7920443 -0.5755429 -0.8451207  0.3462830
3  0.4912815 -1.4238803  1.0867276 -0.8529497  0.2292291  1.5109914
         VEC       N.FI        E.P         NU        E.M        V.M
1  0.4978607  0.1752398  0.7695228 -0.1739576  0.8358682  0.7662524
2 -0.1245313 -0.5700641 -0.9753254 -0.2469149 -0.9315794 -1.0364850
3 -1.7423273  0.8601724 -0.5010846  1.4219837 -0.9022239 -0.3161020
         V.F
1  0.7943476
2 -0.8737206
3 -0.8898422

Appendix

Some graphics introduced in the explorative phase are presented after carrying out the cluster analysis.

Matilde Trevisani

2018-12-06